Counting animals in aerial images with a density map estimation model

Abstract Animal abundance estimation is increasingly based on drone or aerial survey photography. Manual postprocessing has been used extensively; however, volumes of such data are increasing, necessitating some level of automation, either for complete counting, or as a labour‐saving tool. Any automated processing can be challenging when using such tools on species that nest in close formation such as Pygoscelis penguins. We present here a customized CNN‐based density map estimation method for counting of penguins from low‐resolution aerial photography. Our model, an indirect regression algorithm, performed significantly better in terms of counting accuracy than standard detection algorithm (Faster‐RCNN) when counting small objects from low‐resolution images and gave an error rate of only 0.8 percent. Density map estimation methods as demonstrated here can vastly improve our ability to count animals in tight aggregations and demonstrably improve monitoring efforts from aerial imagery.


| INTRODUC TI ON
Aerial imagery has become the principal surveying method for many animal populations (Butler & Muller-Schwarze, 1977;Fraser et al., 1999;Trathan, 2004;Trathan et al., 2012). Such methods are favored since they can quickly and efficiently survey large remote areas with the help of either manned fixed-wing vehicles/helicopters or unmanned aerial vehicles (UAVs), although the conditions associated with each platform type may dictate subsequent image processing. In the past decades, many such ecological surveys have been conducted (Burn et al., 2006;Chabot et al., 2018;Descamps et al., 2011;Dunstan et al., 2020;Groom et al., 2013;Lee et al., 2019;Vermeulen et al., 2013). While this is a very efficient way to collect large amounts of data, it may create a large postprocessing burden that is frequently borne by humans-typically consisting of laborious manual scanning of photos or videos to locate, identify, and count individual animals (Torney et al., 2016). Volume aside, this can be a challenging task due variously to small object sizes, almost indistinguishable fore/background pixels, and varying illuminations (see Figure 1).
To alleviate these problems, there has been extensive work to integrate computer-based image processing to assist in, or fully automate, abundance estimation. Early works were mostly based on either spectral thresholding approaches or template matching approaches to count targets of interest in imagery (Chabot & Bird, 2012;Chabot & Francis, 2016;Christiansen et al., 2014).
However, those methods are often vulnerable to complex situations such as heterogeneous backgrounds (Chabot & Francis, 2016). Hurford (2017) introduced image processing software, ImageJ to assist in counting birds, but its practical use was limited since the software cannot handle complex environments, which is common in various ecological contexts. To alleviate this problem, Marchowski (2021) preprocessed images with a denoising neural network (Buchholz et al., 2020) before counting with ImageJ, which then makes the counting accuracy highly dependent upon the performance of the denoising process. Object-based image analysis (OBIA) has also been popular in previous ecological field studies (Afán et al., 2018;Chrétien et al., 2016;Lhoest et al., 2015;Rush et al., 2018). This method first relies on some handcrafted features to group pixels into objects and then classifies objects according to properties such as shape and size, but its counting performance may suffer when animals are obscured or spatial resolution is low (Afán et al., 2018;Chrétien et al., 2016). Hodgson et al. (2016) and Hodgson et al. (2018) offer some other examples of computerassisted animal counting, where a combination of Fourier analysis and support vector machines are used to exclude background pixels, making the subsequent manual counting of animals easier. For fully automated estimation of animal numbers, convolutional neural networks (CNNs) are commonly adopted, which are a type of deep learning neural network with components particularly directed toward images. Their use in image processing has been transformative, with robustness proved in classification, detection, and segmentation (Simonyan & Zisserman, 2015).
Automated counting of animals within images usually involves the location, and subsequent classification, of objects within a frame. In terms of CNNs, this gives rise to two broad approaches: one-and two-stage algorithms. Two-stage algorithms first propose bounding boxes for locations where objects are likely to exist and then do the classification, where region-based convolutional neural network (RCNNs, Girshick et al., 2014) and Faster-RCNN  are representative examples. One-stage methods such as You Only Look Once (YOLO; Redmon et al., 2016) and Single Shot Multibox Detector (SSD; Liu et al., 2016) process these two tasks simultaneously. In general, one-stage methods have the advantage of computing speed while two-stage methods have better accuracy.
Both methods have been adopted for analyzing aerial images collected in ecological studies. Torney et al. (2019) built a YOLO v3 (Redmon & Farhadi, 2018) model to detect wildebeest in aerial images, which displayed accuracy similar to manual processing while being quick to compute. Later, more studies applied YOLObased methods to detect their own target species in drone footage (Corcoran et al., 2019;Desai et al., 2022;Gorkin et al., 2020;Hamilton et al., 2020). Another one-shot object detector, RetinaNet (Lin et al., 2020) was used in an attempt to build a general model for bird detection (Weinstein et al., 2022). Kellenberger et al. (2017) used a Faster-RCNN model to detect different animals in UAV images surveyed in Kuzikus Wildlife Reserve park. Additionally, the two-stage Faster-RCNN model has also been used to detect koalas (Hamilton et al., 2020), kiang (Peng et al., 2020), and large herbivores (Ma et al., 2022) in aerial images. Hong et al. (2019) compared the performance of different deep learning-based detection methods (Faster-RCNN, SSD, YOLO, RetinaNet) on a UAV aerial image dataset of wild birds and showed the potential of these techniques in monitoring wild animals. Their study pointed out that the two-stage method Faster-RCNN performs the best among all these detection methods with regard to counting accuracy.
Recently, Hoekendijk et al. (2021) proposed a deep CNN model to regress the count objects of interest in the image. Their model is composed of a ResNet  and two fully connected layers. Although showing good performance, their model has a size limit on the input images, which means for a large image, it has to be cropped to a required patch size before passing into the model. This may result in issues such as replicated counts across the boundary of these image patches. Also, their results show the model only F I G U R E 1 Selected data samples with (a) small object size, (b) almost indistinguishable fore/background objects, and (c) varying illuminations are shown. The study object, penguins, is marked with red dots. performs well up to a certain count level-when the count is out of this scope, the model exhibits poor performance.
Here, we adopt a fundamentally different method for counting animals, where the detection of individual animals is avoided, with focus being the estimation of a density map-a concept initially introduced by Lempitsky and Zisserman (2010). Estimated counts are instead obtained by the subsequent integration of this density map, rather than explicit counting of objects. The density map approach has been further integrated into the deep learning framework and widely applied in crowd counting Ma et al., 2019Ma et al., , 2021Qian et al., 2022), where crowds are usually humans.
In this work, we provide a solution to counting animals of low res-

| Density map generation
Our objective was to estimate the number of penguins in an image, here approached by density map estimation. The density maps are an intermediate representation generated from point annotations, with the integration of any region on these maps providing the count of target objects. The generation process is detailed here.
Given an image I with pixels M and a set of 2D annotated points P = p 1 , p 2 , … , p n , its ground-truth density map D gt can be obtained by where I m denotes a two-dimensional pixel location, m = 1, 2, … M and  I m ; p n , 2 n represents the n th annotated two-dimensional Gaussian distribution, p n is the coordinate of n th point annotation, and 2 n indicates the isotropic variance. The setting of 2 n is flexible and often dataset dependent. It can be either fixed (Lempitsky & Zisserman, 2010) or adaptive (distance to nearest neighbours; Zhang et al., 2016). When using the kernel with fixed bandwidth, we are assuming objects are independently distributed on the image plane, while the adaptive bandwidth is normally used to characterize the geometry distortion led by the perspective effect.
The choice of 2 n is crucial for generating density maps, and using an improperly generated density map as a learning target may compromise the model's counting performance (Wan & Chan, 2019).
Ideally, the pixels with density values should reflect consistent features, which in our case means only pixels belonging to a penguin (1) D gt I m = N ∑ n=1  I m ; p n , 2 n will have density values. However, this is hard to achieve, given the typical size of a penguin is only about 5 × 5 pixels, while using a very small Gaussian kernel will lead to a very unbalanced sparse matrix with most values of 0, and will make the network hard to train . To achieve the balance, our generation method is given as follows: given the penguins are almost identical in size and shape in aerial images, the Gaussian kernel with fixed bandwidth is applied to the center point of each penguin and the value of σ is set as 4.
An example of these density maps is given in Figure 2. Although we don't give the location of each penguin, these density maps still retain some location information, which can indicate the region where the penguin may exist.

| Model structure
The overall model structure is shown in Figure 3. It is a simple structure with only a backbone network and two branches. Since VGG-19 (Simonyan & Zisserman, 2015) has good performance in most computer vision tasks, such as detection and classification, and consumes relatively few computing resources, we adopt it as the backbone.
However, VGG-19 learns salient features by gradually downsampling the feature maps. To maintain high resolution of the output density map, we remove its last max pooling layer and all subsequent layers.
Additionally, an upsampling layer is added to keep the final size of the output at 1/8 of the original input. Here, bilinear interpolation is used as the upsampling method.
The models are designed to process two tasks: density map estimation and segmentation. Density map estimation can be seen as a two-step problem by nature, first the location of regions that contains objects of interest and then regress the density values. Second, segmentation is to classify if a pixel belongs to the object of interest. These two tasks are interrelated and can assist the backbone to learn robust intermediate features for each other. Further, the segmentation result is used to guide the density regression. Specifically, to prevent background features from misleading the regressor, the weights of these features are reduced before being fed into the regressor. To achieve this, we generate a mask M d based on the predicted segmentation map: where is the dampening factor and 1 is the indicator function. We set as 0.1 and the generated mask M d is then applied on the intermediate features by point-wise multiplication.
We down-sample the D gt by aggregating the density values to match the output size. The resulted learning target D target is further used in the generation of the ground-truth segmentation map (S gt ): where ϵ is a density threshold and is set as 1 × 10 −3 here.

Density branch & segmentation branch
The two branches in the model share a similar structure. They both consist of three convolutional layers: The first two have a kernel size of 3, while the last one has a kernel size of 1. These layers gradually reduce the number of channels of the extracted features from 512 to 1. The rectified linear unit (ReLU; Zeiler et al., 2013) is used as the activation function for the first two layers, with the activation function for the last layer of the two branches being different. The density branch is activated with the ReLU function to make sure every point on the output is non-negative, whereas for the segmentation branch, the sigmoid (Han & Moraga, 1995) function is used to limit the range between 0 and 1.

| Loss function
Our overall loss function consists of two parts. First, we adopt the structural loss (SL) proposed by Rong and Li (2021) to supervise the density branch, defined as: (2) M d = 1 S pred ≥ 0.5 + 1 S pred < 0.5 ,

F I G U R E 2
Left is a random image (penguins are labeled with red dots) picked from the dataset and its corresponding density map is on the right.
where D pred represents the predicted density map, and Pool stands for average pooling which downsamples the map by a factor of 1 2 i−1 . SSIM is short for the Structural Similarity Index Measures (Wang et al., 2004) that can describe the similarity of two images, expressed as: where μ and σ denote mean and variance while XY represents the covariance of X and Y. C 1 and C 2 are constants, set to 0.01 and 0.03 by default. The higher the SSIM index, the more similar the two images are. N is set as 3 following Wang et al.'s work.
The SL function improves the structural similarity between the prediction and the target by SSIM of high-resolution maps, and the count accuracy is ensured by SSIM of the pooled density maps. The final loss function is a weighted sum of the above two loss functions: with λ set to 0.1 since the density estimation is the main task of the model.

| Model inference
Our model adopts a fully convolutional design, which means it has no strict size constraints on the input image. However, there are four max-pooling layers with kernel size of 2 in the backbone structure, which may result in pixel dropout. To prevent this, the input image has to be enlarged to the smallest size divisible by 16. The output density map D out integrates the predictions from both branches and can be obtained by:

| Experiments
We randomly split our dataset into three parts in a ratio of 3:1:1.
The largest part serves as the training set, and the remaining parts are used for the purpose of validation and test, respectively. The detailed statistics of these three datasets are shown in Table 1.
Notably, these datasets show drastic change in density distribution and all contain a few samples that are only backgrounds.
In our experiments, we adopt random cropping (256 × 256) and random horizontal flipping as data augmentation strategies for training the model. The parameters of the backbone are initialized with the VGG-19 pretrained on ImageNet (Deng et al., 2009) and others are randomly initialized from a Gaussian distribution with a standard deviation of 0.01. We train the network for 600 epochs with a batch size of 16 using the Adam optimizer (Kingma & Ba, 2015).
We fix the learning rate as 1e−5 and the weight decay as 1e−4, with validation starting after the 100th epoch. The model with the best performance on the validation set is used to report the final result on the test set.
For comparison, we also implement a Faster-RCNN model, the detailed training process is provided in the Appendix A.
All experiments were conducted on a single 16 GB Tesla P100 GPU, with methods implemented with Pytorch. The whole training process takes approximately 3 h.

| RE SULTS
To evaluate our method, we use the mean absolute error (MAE) and root mean squared error (RMSE) metrics, defined as: where N is the total number of the images, C pred i and C gt i is the predicted count and the ground-truth count of i-th image, respectively.
Mean absolute error gives the average error between predicted and target values, which can provide direct evidence of the accuracy of a model. However, it is not sufficiently sensitive to undesirable large errors. Therefore, RMSE is used as another evaluation metric since RMSE gives a relatively high weight to large errors.
MAE and RMSE can be used jointly to diagnose the performance of a model. The larger the difference between them means the greater the variance in the individual errors in a dataset. It is unclear which one of these metrics is more important; hence, we simply define the model which has the lowest sum of MAE and RMSE on the validation data as the best model. This model's performance on the test set is shown in Table 2. To better illustrate our model's performance, we provide the results from a Faster-RCNN model for comparison.
Some of the estimated density maps are presented in Figure 4.
Although the prediction's resolution is only one-eighth the resolution of the generated ground-true density map, it exhibits similar characteristics at the image level.

| DISCUSS ION
The algorithmic counting of objects in aerial images in ecological studies is currently dominated by detection algorithms. However, we have shown here that our model has various advantages over these methods.
Overall, our model has four main advantages over detection methods beyond markedly better performance on our data. First, our method is able to count extremely small objects. In the case of aerial images, the object of interest in an image is likely to be very small, especially for ecological surveys-in our case, only about 5 × 5 pixels. Our experiments show even the two-stage detection algorithm Faster-RCNN fails to detect most of the penguins. The reason is as follows: No matter what detection methods, a backbone structure is essential for extracting features. However, the current mainstream deep network structure, often used as the backbone, will downsample the image to a certain extent, for example, the downsampling ratio of VGG series is 16, while 32 for ResNet series . With a high downsampling ratio, the representation of a small object on the final feature maps may not be abundant enough for subsequent neural networks to predict the location and classification simultaneously. In contrast, our density estimation model only focuses on the counting of locations on the feature map instead of individuals, which provides better count accuracy. Second, our model only requires point annotation, which means annotators need only to mark the same part of each object with a dot, quite similar to the way human counts. In contrast, detection algorithms require bounding box annotations and the quality of these will have a large impact on their performance (Russakovsky et al., 2015). For each object, a high-quality bounding box is characterized as one with the smallest area but covers all the visible parts of that object. To create such annotation is laborious compared with point annotation, which only requires a point be drawn on the object and need not be especially accurate. This simplicity is exemplified in the generation of the ground-truth density maps-the size of each penguin is not required, we need only apply the same normalized Gaussian kernel on every penguin.
Third, the density map estimation method can better handle objects located at the edge of the image. It is often the case that the images are of large size, with considering GPU memory

F I G U R E 4 Some visualization results
of the estimated density maps. The three images in each row, from left to right is the input, the Gaussian-smoothed groundtruth density map and the prediction. The corresponding count is given in the lower right corner of the density map. The difference between the ground-truth and the estimated counts is highlighted.

CO N FLI C T O F I NTE R E S T S TATE M E NT
The authors declare no competing interests.

DATA AVA I L A B I L I T Y S TAT E M E N T
The data and the code used in this work are available at the Dryad Data and can be accessed via https://datadryad.org/stash/dataset/ doi:10.5061/dryad.8931zcrv8. Qian et al., 2022).

A.1 | Specification of the Faster-RCNN model
In this section, we provide the detailed training process of the Faster-RCNN model and display full evaluation results.

A.2 | Experiments
Faster-RCNN has many hyper-parameters, in our experiments, most of them are kept in consistent with the original work )-we only highlight the differences here. The input images are enlarged by four times to ensure every object is larger than 16 × 16 pixels and detectable. The number of anchor boxes is reduced from 9 to 6 since the small variation in the size of objects. The size of these boxes are set as much as possible to match the size of the objects in the dataset-specifically the area of the anchor boxes are 16 × 16 and 24 × 24 with aspect ratios of 0.9, 1, 1.1.
For fair comparison, we use the pretrained VGG-19 as backbone in the Faster-RCNN model. The data augmentation technique used here is only random horizontal flipping. The batch size is set as 1. A total of 25 epoch are trained with the stochastic gradient descent (SGD) optimizer (Ruder, 2017). The initial learning rate is set as 1e−3 and decays to 1e−4 at the twelfth epoch. Only images without negative samples are used for training, and the whole training process takes around 5 h to complete.

A.3 | Results
Different to the density map estimation method, we define the model with the highest 11-points interpolated average precision (AP) score on the validation set as the best detection model. The AP is computed from 11 recall levels and can be expressed as: where p is the precision and r is the recall. p interp (r) represents the maximum precision for any recall r ′ ≥ r. The definition of precision and recall is given below: where TP is the true positive, FP is the false positive, and FN is the false negative.
Here, we report the results of the best detection model on the test set. The detection results are varied with the intersection over union (IoU) threshold and the confidence threshold. The IoU threshold is fixed as 0.3 in our experiments, and we only adjust the confidence threshold. For better understanding, we still present the performance of the detection model with MAE and RMSE metrics and the full results are displayed in Table A1.
We visualize some detection results in Figure A1

F I G U R E A 1
Visual examples of the Faster-RCNN model when setting the IoU threshold to 0.3 and the confidence threshold to 0.2. In the left image, the ground-truth bounding boxes are red, and the prediction boxes are green. The corresponding detailed statistic is given on the right.